Information
Assignments due this week
Office hours
Whether you’re encrypting an email, sending a message on your mobile phone, connecting to an HTTPS website, or connecting to a remote machine through IPSec or SSH, there’s a hash function somewhere under the hood. Hash functions are by far the most versatile and ubiquitous of all crypto algorithms.
– Jean-Philippe Aumasson, Serious Cryptography (Chapter 6)
By the end of today's lecture, you will learn:
There are five parts to this course.
Last time, we saw an introduction to cryptography, and an introduction to cryptocurrencies.
Our starting point into cryptocurrencies is to build a digital version of a banking system, where people can transfer money to each other electronically over the Internet without the need for a single trusted banking authority.
In the physical world, when Bob receives a check from Alice, he may be suspicious that Alice is trying to deceive him. He needs to be able to verify the following three properties.
Alice properly signed the check
Alice possesses $1 in her bank account
Alice does not double spend the money by writing a check to someone else at the same time
Last week, we started to explore a cryptographic primitive that will address property #1: building a digital version of signing a check. This primitive is (appropriately enough) called a digital signature.
Today, we will study a special function that is needed to build a digital signature and other crypto tools.
This special function is called a cryptographic hash function, and we will have many uses for it throughout the semester.
Let's start by exploring how computers represent data. Throughout this course, we will typically represent data as abstractly as possible: as a sequence of bits of a particular length.
print("byte bits")
for x in range(9): print("{:4d}".format(x), " {0:08b}".format(x))
print(" ...")
for x in range(248,257): print("{:4d}".format(x), " {0:08b}".format(x))
byte bits 0 0b0 1 0b1 2 0b10 3 0b11 4 0b100 5 0b101 6 0b110 7 0b111 8 0b1000 ... 248 11111000 249 11111001 250 11111010 251 11111011 252 11111100 253 11111101 254 11111110 255 11111111 256 100000000
A bitstring is an arbitrary-length sequence of bits. The set of all bitstrings of length $n$ is denoted as $\{0, 1\}^n$. The set of bitstrings of any possible length is denoted as $\{0, 1\}^*$.
(We will typically, though not always, consider bitstrings that are a multiple of 8 bits; i.e., that can be decomposed into bytes with no "leftover" bits.)
Within Python, you can represent a value of type byte using a string with the letter b in front of it. For example, here is a byte that holds the ASCII value of the character A.
s = b'A'
print(s)
type(s)
b'A'
bytes
Some (but not all!) bytes correspond to printable characters in the American character encoding standard, or ASCII.
print("byte ASCII character")
for x in range(4): print("{:4d}".format(x), chr(x))
print(" ...")
for x in range(65,137): print("{:4d}".format(x), chr(x))
print(" ...")
for x in range(251,256): print("{:4d}".format(x), chr(x))
byte ASCII character
0
1
2
3
...
65 A
66 B
67 C
68 D
69 E
70 F
71 G
72 H
73 I
74 J
75 K
76 L
77 M
78 N
79 O
80 P
81 Q
82 R
83 S
84 T
85 U
86 V
87 W
88 X
89 Y
90 Z
91 [
92 \
93 ]
94 ^
95 _
96 `
97 a
98 b
99 c
100 d
101 e
102 f
103 g
104 h
105 i
106 j
107 k
108 l
109 m
110 n
111 o
112 p
113 q
114 r
115 s
116 t
117 u
118 v
119 w
120 x
121 y
122 z
123 {
124 |
125 }
126 ~
127
128
129
130
131
132
133
134
135
136
...
251 û
252 ü
253 ý
254 þ
255 ÿ
Python natively represents numbers in the following formats:
| Data type | Base | # bits encoded in each symbol |
|---|---|---|
bin |
2 | 1 |
int |
10 | about 3.32 |
hex |
16 | 4 |
The hexadecimal, or base 16, representation of a number uses the symbols 0-9 and a-f to denote the values 0 through 15.
By convention, Python places a 0x prefix in front of a number to remind you that it is written in hex format.
print("int hex")
for x in range(16):
print("{:3d}".format(x), " ", hex(x))
print("...")
for x in range(240,256):
print("{:3d}".format(x), " ", hex(x))
int hex 0 0x0 1 0x1 2 0x2 3 0x3 4 0x4 5 0x5 6 0x6 7 0x7 8 0x8 9 0x9 10 0xa 11 0xb 12 0xc 13 0xd 14 0xe 15 0xf ... 240 0xf0 241 0xf1 242 0xf2 243 0xf3 244 0xf4 245 0xf5 246 0xf6 247 0xf7 248 0xf8 249 0xf9 250 0xfa 251 0xfb 252 0xfc 253 0xfd 254 0xfe 255 0xff
The Python libraries binascii and PyCrypto provide commands to convert between raw bytes and numbers in bin, int, or hex formats.
from Crypto.Util.number import bytes_to_long, long_to_bytes # requires PyCrypto
from binascii import hexlify, unhexlify
t = bytes_to_long(s)
print(t)
print(type(t))
65 <class 'int'>
u = hexlify(s)
print(u)
print(type(u))
b'41' <class 'bytes'>
v = bin(t)
print(v)
print(type(v))
0b1000001 <class 'str'>
Note that Python uses the prefix 0b to denote a number in binary format.
Let's rewind time and imagine that the date is currently November 30, 2007.
At that time, here were some of the candidates who were considered as strong candidates for the 2008 U.S. presidential election.
| John Edwards | Al Gore | John McCain | Barack Obama | Mitt Romney |
|---|---|---|---|---|
[Images from Wikipedia]
Here's the scenario:
[Source: Nostradamus website.]
There is a clear problem here: if Alice won't reveal the image $x$ until after November 2008, then how does Bob know that she knew the answer beforehand?
Question. Can we design a way to make this bet that provides:
What would be great is if there was a function $H$ that we could apply to the image $x$ that:
If we had such a file, then
Bob can then compute $H(x)$ himself and test whether it equals the fingerprint $y$ that he received back in 2007. If so, then he knows that Alice did have that image in mind back in 2007.
Often called a cryptographer’s Swiss Army knife, a hash function can underlie many different cryptographic schemes: aside from producing a document’s digest to be digitally signed—one of the most common applications.
Let us give example applications: code-signing systems, computer forensics, [and protecting] passwords.
– Aumasson, Meier, Phan, and Henzen, The Hash Function BLAKE.
A cryptographic hash function is a mathematical function, which we typically denote by $H$, that has two seemingly-contradictory requirements.
You can think of a hash function as a gigantic function that translates between two languages:
Moreover, the outputs have a fixed, known length, no matter how short or long the input messages are.
| Input | Output |
|---|---|
| abandon | nieh |
| ability | btol |
| able | ykwf |
| about | fkay |
| $\vdots$ | $\vdots$ |
| zipper | jgxm |
| zone | pnjo |
| zoo | ansm |
[Note: this truth table was produced by mapping the top 3000 English words to random 4-letter strings using random.org.]
While this toy example used English-language words as the possible inputs, in reality we want a hash function to accept any data type. To be as generic as possible, we will consider the inputs and outputs to be bitstrings.
As a result, we can apply $H$ to Alice's image file from our election example.
Using the language of bitstrings, we can now define a hash $H$ as we stated previously: it's a single mathematical function that can accept inputs of any size and that always returns an output of a fixed length.
Definition. A hash function $H: \{0, 1\}^* \to \{0, 1\}^n$ is an efficiently-computable function that
Note that there is nothing hidden or secret here: the code of $H$ is public knowledge and anyone can compute it.
The output $H(x)$ is often called the hash digest corresponding to the input $x$.
Because $H$ has infinitely many inputs and only finitely many outputs, it must be the case that there are collisions -- that is, two inputs that map to the same output string.
But: we are going to insist that collisions are computationally infeasible to find, just like they would be in our motivating example of the randomly-generated foreign language.
This guarantee should hold even if we give the code of $H$ to our adversary.
Here is our adversary in this course. Her name is Mallory, because she is malicious and doesn't follow the rules of any crypto protocol.
Mallory is intended to be a stand-in for any kind of real-world adversary who you might be concerned about: your internet service provider, cloud provider, government, or anyone else who might want to ruin your day.
Mallory has a substantial (though not infinite) amount of computing power at her disposal.
She's also very smart. Specifically, we must assume that she is a better cryptanalyst than we are. This is to account for Schneier's law.
"Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can't break."
– Bruce Schneier
Even though Mallory is mathematically savvy and computationally powerful, we insist that she must still not be able to "break" our hash function $H$.
To make this claim mathematically precise, we need to specify precisely the problem that Mallory cannot break.
In fact we will define multiple such problems. As we go down the list:
Definition. A cryptographic hash function $H$ satisfies the following properties.
One wayness, aka preimage resistance: if we sample $y \gets \{0, 1\}^n$ uniformly at random from the set of all bitstrings of length $n$, then it is practically infeasible for Mallory to find any preimage $x$ such that $H(x) = y$.
(Note: remember that there are infinitely many such preimages, yet Mallory cannot find even a single one!)
Second preimage resistance: given a "randomly" chosen $x$, it is practically infeasible for Mallory to find another $x'$ such that $H(x) = H(x')$ but $x \neq x'$.
(Note: this statement is not mathematically precise, since it isn't possible to sample uniformly at random from an infinite space like $\{0, 1\}^*$. But I will ignore this issue since we will not focus on second preimage resistance in this course.)
Collision resistance: it is practically infeasible for Mallory to find two different inputs $x_1$ and $x_2$ such that $H(x) = H(x')$.
(Note: if a function is collision resistant then it must be one way, but the converse is not necessarily true. Do you see why that is?)
If an adversary [Mallory] has not explicitly queried the oracle on some point $x$, then the value of $H(x)$ is completely random... at least as far as [Mallory] is concerned.
– Jon Katz and Yehuda Lindell, Introduction to Modern Cryptography
Note that there is a limit to how infeasible it is to find a collision. After all, we already said that collisions must exist. And the output space is finite, so if we have enough time and computing power, it must be possible to find a collision.
Exactly how difficult can we hope that the problem might be? That is, even with the best possible hash function, what is a blunt and brute force attack that will always succeed?
Let's step away from hash functions for a moment, and look at a seemingly-unrelated question. Let $K$ be the number of people in this classroom.
Question. What is the probability that two people in the room have the same birthday?
(For the purpose of this question, suppose that everyone's birthday is sampled uniformly at random from the set of 366 possible choices. This is not a valid assumption for many reasons, e.g., leap years, but let's go with it for now.)
It might intuitively seem like the answer should be $\frac{K}{366}$.
But that is incorrect! It would be the answer to a different problem:
What is the probability that at least one of $N$ students in this classroom has the same birthday as Prof. Varia does?
The distinction between these two questions (and their corresponding answers) is called the birthday paradox.
The answer to the original question depends on the number of pairs of people in the classroom.
So the answer to the question is closer to ${K \choose 2} / 366 \approx K^2 / (2 \cdot 366)$ than it is to $K / 366$.
But ${K \choose 2} / 366$ is not exactly right either.
The reason is the principle of inclusion-exclusion: we would be overcounting in scenarios where many people have the same birthday.
Instead, it is easier to think about the converse problem. Suppose $K = 23$, and let's imagine that the $K$ people walk into the classroom, one at a time.
What is the probability that all of them have different birthdays?
As a result, the overall probability that everybody's birthday is distinct is small:
$$ 1 \times \frac{365}{366} \times \frac{364}{366} \times \cdots \times \frac{344}{366}$$import math
math.prod(range(344,367)) / (366**23)
0.49367698818054007
As a result, if there are 23 people in the classroom, then it is more likely than not that two of us have the same birthday.
Here's a graph that shows the probability of two people having the same birthday in a room of $N$ people.
Generalizing from the specific case of birthdays (where there are $N = 366$ options): when drawing with replacement from a set of size $N$, then:
$$ E[\text{# of attempts until the first collision}] \approx 1.25 \sqrt{N}.$$This approximation works well even for fairly small choices of $N$, such as $N = 366$. And it works really well for larger $N$.
import math
1.25 * math.sqrt(366)
23.91390808713624
Moreover, the distribution is tightly correlated around this expected value. There is less than a 15% chance that:
Returning back to hash functions: the key insight from the birthday problem is that:
The best-possible security that we can hope for a hash function $H: \{0, 1\}^* \to \{0, 1\}^n$ to achieve is related to the length of the output space $n$.
Example. Suppose that we build a hash function whose output length is $n = 3$ bits. Let's try to break one-wayness and collision resistance using a simple guess-and-check approach.
To break one-wayness for (say) $y = 000$, suppose we just select inputs $x$ at random and compute the hash function until we find an input where $H(x) = y$.
Now let's try to break collision resistance. Once again, suppose that we aimlessly select inputs $x$ at random and build a table of all $(x, H(x))$ pairs until we find two inputs with the same digest.
So in order to have a secure hash function, it is necessary (though not sufficient) to choose a large output length $n$. A common choice in cryptography is to choose $n = 256$ bits.
Extrapolating from the analysis above, our brute-force attack succeeds at breaking:
These are both big numbers. And they're very different scales of big-ness. Remember that $2^{256}$ is not two times as big as $2^{128}$. It is instead $2^{128}$ times as big!
In fact, both of them are so big as to be practically impossible for anyone to execute, with the computing power available today or in the near future.
To give a sense of how big the number $2^{128}$ is:
To compute the hash function $2^{128}$ times, it would take the entire world (with our current computing power) about $2^{128} / 2^{70} = 2^{58}$ seconds.
(Admittedly, this calculation discounts the effect of Moore's law. But hopefully it gives you a sense of the vast scale of computing power involved here.)
The number $2^{256}$ is much, much larger than this. To give you a sense of its scale: suppose that Mallory decided to try running the hash function on $2^{256}$ inputs, say by running a for loop to try all inputs from
from binascii import hexlify
print(hexlify(32 * b'\x00')) # taking 32 copies of the byte value 0 (printed in hexadecimal format)
b'0000000000000000000000000000000000000000000000000000000000000000'
to this:
print(hexlify(32 * b'\xff')) # taking 32 copies of the byte value 255 (printed in hexadecimal format)
b'ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff'
Then:
for loop on a modern computer would require more time than the expected remaining length of the universe.So far, we have talked about the best-possible security that a hash function can hope to achieve.
But is this hope actually achievable?
We believe the answer is yes!
Unfortunately, I cannot analytically prove to you that any cryptographic hash function actually meets the definition above; that would require fundamental breakthroughs in computer science like proving $\textit{P} \neq \textit{NP}$.
So instead, we select hash functions empirically, based on how well they hold up against the worldwide cryptanalysis community.
The U.S. National Institute of Standards and Technology (NIST) develops standards for all cryptographic building blocks: hash functions, digital signatures, encryption schemes, and more.
To date, NIST has developed three standards for hash functions (and also subsequently revoked one of them). These algorithms are called the Secure Hash Algorithms, or SHA for short. There have been three iterations of hash functions.
SHA-1 was initially developed in 1995 by the U.S. National Security Agency. It had an output length of $n = 160$ bits, so by the birthday bound at best we can hope that it takes $2^{80}$ time to find a collision.
But through a combination of incredibly clever math to design a more clever collision-finding algorithm and a lot of computing power to execute it, researchers have broken SHA-1 and you should never use it.
Example. Here are two PDF files.
| good.pdf | bad.pdf |
|---|---|
It turns out that they have the same hash digest using the SHA-1 hash function.
from hashlib import sha1
# read the contents of two files: good.pdf (a check mark) and bad.pdf (an X)
with open('images/good.pdf', 'rb') as file:
good = file.read()
with open('images/bad.pdf', 'rb') as file:
bad = file.read()
# verify that the strings are different
assert(good != bad)
# outputs are 40 hex characters, or 160 bits in length
hashGood = sha1(good).hexdigest()
hashBad = sha1(bad).hexdigest()
# observe that the hash function outputs are the same
print(hashGood)
print(hashBad)
assert(hashGood == hashBad)
d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a
So which hash functions should you use?
SHA-2 was initially developed in 2001, also by the U.S. National Security Agency. It is the most popular hash function algorithm today. It offers a choice of four possible output lengths: 224, 256, 384, or 512 bits.
SHA-3 was standardized in 2015 after a multi-year open competition. Its goal was to design a hash function in a fundamentally different way as SHA-2, in order to have a "failsafe" plan in case someone finds a collision in SHA-2.
To the best of our collective knowledge as a crypto community, for both of these functions are secure.
In particular, the best known algorithms to break collision resistance require about $2^{n/2}$ work to break, when the hash function is set to have output length $n$.
from Crypto.Hash import SHA256
# output is 64 hex characters, or 256 bits in length
print(SHA256.new(b'Hello world!').hexdigest())
c0535e4be2b79ffd93291305436bf889314e4a3faec05ecffcbb7df31ad9e51a
We will see many uses of hash functions throughout this course. For example, next time we will see how hash functions can be used within the construction of digital signatures.
For today, let's go back to the fingerprinting application from the beginning of the lecture. Suppose that Alice wants to store a large file $f$ on a cloud service like Dropbox or Google Drive. The file is important to Alice, so when she retrieves it later, she wants to be able to check that the file hasn't been altered or corrupted on the cloud.
How can Alice do this?
Here's one common approach:
Compute the hash of the file $h = \texttt{SHA256}(f)$ before you upload the file $f$ to the cloud. Store $h$ on your own computer. (Remember that $h$ is small, only 32 bytes in size.)
When you later download a file $f'$, check that $\texttt{SHA256}(f') \overset{?}{=} h$ in order to verify that $f'$ is indeed the same as the file $f$ that you previously uploaded.
This approach is both simple and secure.
Modern hash functions are very quick to compute, even on large files that are gigabytes in size.
Due to collision resistance, even if Mallory is running the cloud provider, she cannot find a new file $f'$ that is different from Alice's original file $f$ and yet still hashes to the same string $h$.
What if Alice wants to upload many files $f_1, f_2, \ldots, f_k$ to the cloud?
Fortunately, there is another approach that still allows Alice to verify one file at a time. Before uploading the files, Alice must construct a Merkle tree as follows.
(This data structure is named after cryptographer Ralph Merkle.)
Source: Alin Tomescu, Decentralized Thoughts blog
The nice part about Merkle trees is that:
To construct a proof that (for example) a claimed file $f_3^*$ is in the set, the cloud provider Mallory sends the following $\log(k)$ hash digests to Alice.
Source: Alin Tomescu, Decentralized Thoughts blog
I put an asterisk on the received file $f_3^*$ and the received hash digests $h_4^*, h_{1,2}^*, h_{5,8}^*$ because Alice does not yet know that the cloud provider Mallory is telling the truth. That is, these might potentially be different than the variables that Alice used when she initially constructed the Merkle tree.
To test whether the received file $f_3^*$ is correct, Alice can "fill in the blanks" of the path of the Merkle tree from the leaf up to the root.
$$\begin{align} h_{3}^* &= H(f_3^*) \\ h_{3,4}^* &= H(h_3^*,h_4^*) \\ h_{1,4}^* &= H(h_{1,2}^*,h_{3, 4}^*) \\ h_{1,8}^* &= H(h_{1,4}^*,h_{5,8}^*) \\ \end{align}$$Remember that the real file $f_3$ contributed toward $\log(n)$ hash digests. For instance, if we "trace" the path of the file $f_3$, here are all of the variables that it influenced.
$$\begin{align*} h_{3} &= H(f_3) \\ h_{3,4} &= H(h_3,h_4) \\ h_{1,4} &= H(h_{1,2},h_{3, 4}) \\ h_{1,8} &= H(h_{1,4},h_{5,8}) \\ \end{align*}$$Because Alice has stored $h_{1,8}$, she can test whether $h_{1,8}^* = h_{1,8}$. If this equality holds:
From the final equation: it must be true that $h_{1,4}^* = h_{1,4}$ and $h_{5,8}^* = h_{5,8}$. Otherwise, $H(h_{1,4},h_{5,8}) = H(h_{1,4}^*,h_{5,8}^*)$ so Mallory has broken collision resistance.
From the second-to-last equation: it must also be true that $h_{1,2}^* = h_{1,2}$ and $h_{3,4}^* = h_{3,4}$. Otherwise, $H(h_{1,2},h_{3,4}) = H(h_{1,2}^*,h_{3,4}^*)$ so Mallory has broken collision resistance.
...and so on, until eventually the first equation tells us that $f_3 = f_3^*$, as desired!
In this way: hash functions provide a succinct yet tamper-evident way to represent large volumes of data.